\documentclass{llncs}

\usepackage{boxedminipage}
\usepackage{setspace}


\newcommand{\bequ}{\begin{quote}}
\newcommand{\enqu}{\end{quote}}
\newcommand{\begit}{\begin{itemize}}
\newcommand{\enit}{\end{itemize}}
\newcommand{\LBNF}{LBNF}


% commands from LBNF documentation
\newcommand{\emptyP}{\mbox{$\epsilon$}}
\newcommand{\terminal}[1]{\mbox{{\texttt {#1}}}}
\newcommand{\nonterminal}[1]{\mbox{$\langle \mbox{{\sl #1 }} \! \rangle$}}
\newcommand{\arrow}{\mbox{::=}}
\newcommand{\delimit}{\mbox{$|$}}
\newcommand{\reserved}[1]{\mbox{{\texttt {#1}}}}
\newcommand{\literal}[1]{\mbox{{\texttt {#1}}}}
\newcommand{\symb}[1]{\mbox{{\texttt {#1}}}}

\title{{\Large \bf BNF Converter: One Grammar, Multiple Front Ends}}

\author{Michael Pellauer, Markus Forsberg, and Aarne Ranta \\
  Department of Computing Science \\
  Chalmers University of Technology and the University of Gothenburg \\
  SE-412 96 Gothenburg, Sweden\\
  \{pellauer,markus,aarne\}@cs.chalmers.se}

\date{\today}

\begin{document}

\maketitle


\begin{abstract}
With the advent of multi-phase compilers and advances in language design the focus of the compiler front end has begun to shift from parsing to the definition of a complex abstract-syntax-tree data structure. BNF Converter is a compiler construction tool that uses a Labelled BNF grammar as the single source of definition for the abstract syntax, lexer, parser and pretty printer. The added layer of abstraction allows it to generate front ends in Haskell, Java, C or C++. This results in new possibilities for the user, and allows us to make observations about constructing a compiler across different implementation languages.
\end{abstract}

\subsubsection*{Keywords}

\textit{compiler construction, parser generator, grammar, labelled BNF, 
        abstract syntax, pretty printer, document automation}


\section{Introduction}

Programming languages of the past often contained features that seem awkward today. For example, Aycock \cite{horspool} reports that in PL/I it was possible to write statements such as this:
\begin{center}
{\tt IF IF = THEN THEN THEN = IF}
\end{center}
This is legal code because after an IF-statement an identifier was expected, and the strings ``IF'' and ``THEN'' were also legal as identifier names. 

As programming languages became more and more widespread, language developers began to realize
that these features not only created numerous problems implementing parsers,
but were rarely used and often led to confusion when they were. Because of this programming languages have become more and more regular over time.

Today, no Java programmer would expect the ability to use \texttt{if} or any other language keyword as a variable name. Other restrictions, such as identifier names not starting with numbers, have become so internalized that today most programmers follow them without consciously realizing that they exist.

One consequence of these early deviant features is that the focus of past compiler construction tools was on the parser and lexer, which often had to be quite complex. Tools in the YACC tradition are based on the notion of semantic actions. These semantic actions give the parser the power to deal with aberrant features\footnote{Semantic actions also allow documentation writers for these tools to do tricks like implementing a calculator directly in the parser. However, as Appel \cite{AppelJ} points out, using semantic actions for anything more than this is difficult to maintain, and constrains the compiler to analyze the program in the order in which it is parsed.}, but today the ability to execute arbitrary code during parsing is not often needed. Indeed, in modern multi-phase compilers the parser generally does nothing more than construct an abstract syntax tree and pass it on to the next phase.

Today this process of constructing and transforming the abstract syntax tree is by far the focus of the compiler front end. The BNF Converter is a compiler construction tool based on the idea that from a single source grammar it is possible to extract both an abstract syntax tree definition (including a traversal function) and a concrete syntax (including lexer, parser, and pretty printer).

This decoupling of the grammar description from the implementation language  continues the tradition of Andrew Appel \cite{AppelC, AppelJ, appel}, whose textbooks apply the same compiler methodology across three very different programming languages. Similarly, the BNF Converter as of version 2.0 is able to use the same methodology to take a single grammar and generate a compiler front end in Haskell, Java, C++, and C.

\subsection{The BNF Converter Approach}
Compiler construction courses often teach students with tools in the tradition of Lex and YACC. This requires learning two separate configuration syntaxes, as well as getting the lexer and the parser to communicate correctly, which novices often find tricky. After
taking all of the time to implement both the lexer and the parser the user must  hand-write data types to represent their abstract syntax tree, create some kind of test bench to verify that the parser is performing correctly, and document the language so that it may be used by others.

The BNF Converter takes a different approach to the task of front-end
generation. The user specifies the language grammar in an enhanced
version of BNF called Labelled BNF. This grammar gives the
tool all the information it needs to generate an abstract syntax tree, traversal template, pretty printer, and test bench. Additionally while it may be often hard to write correct code for tools like YACC, it is straightforward to {\em generate} correct code for them. Hence the BNF Converter also generates specifications for a lexer and parser.

This approach offers many advantages. First of all, the front end is defined in a
single source file based on the well-known and easy to read Backus Naur Form. This increases maintainability. Secondly this higher level of abstraction makes it easy for our tool to check the grammar for problems, rather than attempting to check code written directly in an implementation language like C.

Finally, this approach decouples the language-definition grammar from the implementation-language syntax, allowing the programmer to
experiment with implementing the same methodology over multiple languages. This opens up new possibilities, such as using a server application written in C++ to pretty-print output that will be parsed by a Java application running on a PDA. It also gives the opportunity to create a prototype language implementation in Haskell, then switch to C for development once the language definition has been finalized.

This paper gives an overview of the LBNF grammar formalism. We then compare the methodology the BNF Converter uses to generate code in Haskell, Java, C++, and C, highlighting the most significant differences of implementing a compiler in these different implementation languages. Finally, we conclude with a discussion of our experiences using the tool in education and language prototyping.

\subsection{Note on Language Describability}
Overall, the BNF Converter gives up the power of semantic actions in favor of a declarative grammar. Experienced language implementors may feel that this restricts the tool to use on toy languages. However, the Labelled BNF formalism introduces new features that increase its usefulness and expressibility.

Additionally, it is sometimes straightforward to move from a language that is nearly describable by a grammar to one that is completely describable. For example, the maintainers of C have long known that they can describe the language through a completely ideal grammar with the removal of a single production via preprocessing \cite{Kern88}. Other features, such as Haskell's layout syntax, can be handled by adding a processing level between the lexer and the parser.

\input{LBNF}

\input{haskell}

\section{Java Code Generation}

But what about translating a grammar into an object-oriented language like Java? How can we translate a declarative grammar into an abstract syntax tree?

Appel outlines two possible approaches in his book \textit{Modern Compiler Implementation in Java} \cite{AppelJ}. 

In the first method, which Appel refers to as the ``Object-Oriented method,''
there is one Java class for each rule in the language grammar. Each class
inherits from a single superclass, and each class defines operations on itself.
For instance, if our compiler were to translate to SPARC and Intel assembly code
each class would have a method \texttt{toSPARC()} and \texttt{toIntel()} that would change that
class to the appropriate representation. The advantages of this method is that
it is quite easy to add new language categories. We simply add
some new classes and implement all the appropriate methods on this class. The
disadvantage is that it's hard to add new functions on the whole syntax tree.
Adding a function \texttt{toAlpha()} for instance, could result in editing hundreds of classes.

In the second, ``syntax separate from interpretations'' method there is still one Java class for each grammar rule, but now classes are simply empty data structures with no methods aside from a constructor. Translation functions are removed from the data structure, and traverse the tree in a method similar to the Visitor Design Pattern. The advantages to this method are that translations share a relatively isomorphic structure, and that these functions can make better use of context information than single objects' methods. The disadvantage is that adding new language constructs requires changing all existing traversal functions to handle the new cases.

However, the BNF Converter, which makes the grammar the central point of all language changes, lessens this disadvantage.  Additionally, since translation functions are now traversals, it is easy for our tool to generate skeleton functions as we do in Haskell and for the user to reuse the template in all transformations.\footnote{Of course, if the user implements a translation and then modifies the language definition they must still change the implemented code to reflect the modifications. However, they can refer to the template function in order to locate the differences.} Therefore the BNF Converter uses this method in generating Java (and C++) abstract syntax.

\subsection{Java Abstract Syntax Generation}

Let us return to our example of Boolean Expressions from earlier (Figure \ref{fig:source}). Given this grammar the BNF Converter will generate the abstract syntax found in Figure \ref{fig:java}A, following Appel's method.

There are several differences between this transformation and the Haskell version that should be highlighted. First, experienced Java programmers will quickly notice that all the generated classes are public, and in Java public classes must go into their own \texttt{.java} file (with class name matching the file name). Since it is not uncommon to have hundreds of productions in an LBNF grammar, the user's source directory can quickly become cluttered, so Abstract Syntax classes are placed into a sub-package called {\tt Absyn}, and thus must be kept in a filesystem subdirectory of the same name (which the tool creates).

There is a second difference in the code in Figure \ref{fig:java}A. Mainly: names. Classes in Java have instance variables and parameters, and all of these require unique names (whereas in Haskell data structures the names are only optional). First, we realize that parameter names generally don't matter - we can simply give them the name ``p'' plus a unique number. The names of instance variables on the other hand, do matter. The BNF Converter converts the type name to lowercase and add an underscore to prevent conflicts with reserved words. Then, if there is more than one variable of a type then they are numbered. Thus the classes EPlus and ETimes have members exp\_1 and exp\_2.

Notice that Appel's method uses public instance variables, which may be off-putting to object-oriented programmers today. Indeed, there is some question as to whether Appel today would use public members, or would use private members with accessors and mutators. We have chosen remain with the original method, both to keep a higher correspondence to the textbook, 
and to ease the generation of the Pretty Printer and other traversals.

Finally recall that Java does not yet support polymorphic lists. Generic types will be supported in the upcoming Java 2 Platform, Standard Edition 1.5 release. However, until the new methodology is widely adapted, the BNF Converter simply generates simple null-terminated linked lists for each list that the grammar makes use of. These special classes are prefixed with ``List'' is defined, such as the class ListEXP above, which takes the place of Haskell's [EXP].

\subsection{The Lexer and Parser}

The BNF Converter generates specification files for the JLex \cite{jlex} and CUP \cite{cup} tools which create a lexer and parser in a manner isomorphic to the Haskell version. The difference between the tools is mainly a matter of syntax. For example, CUP cannot work with strings directly but requires terminal symbols be defined for each language symbol or reserved word. Also, CUP does not refer to variables with \$ variables like Bison, but rather by assigning names to all possibly used values. A fragment of the specifications equivalent to the Alex and Happy code in Figure \ref{fig:haskell} is shown in Figures \ref{fig:java}B and \ref{fig:java}C.

\subsection{The Java Pretty Printer and Skeleton Function}
\label{javapp}

Similar to the Haskell version, the Java pretty printer linearizes the abstract syntax tree using some easily-modifiable heuristics. It follows the method Appel outlines, using Java's \texttt{instanceof} operator to determine which subclass it is dealing with, then down-casting and working with the public variables. For example, the code to pretty-print an EXP is found in Figure \ref{fig:java}D.

However, the pretty printer alone is not enough to test the correctness of a parse. In the Haskell version the built-in \texttt{show} function is used to print out the abstract syntax tree so that the programmer can confirm its correctness. We could use Java's \texttt{toString()} method for a similar role, but this is not satisfying as it is generally used for debugging purposes. Instead the BNF Converter adds a second method to the pretty printer, similar to Haskell's show function, shown in Figure \ref{fig:java}E.

Throughout both methods the generated code makes use of Java's \texttt{StringBuffer} class to efficiently build the result of the linearization.

Appel's \texttt{instanceof} method is also used to generate a code skeleton. However, this method may seem awkward to many object-oriented programmers, who are often taught to avoid \texttt{instanceof} wherever possible.

Much more familiar is the Visitor Design Pattern \cite{visitor}. In it each member of the abstract syntax tree implements an accept method, which then calls the appropriate method in the visiting class (double-dispatch).

There is no reason that these two methods cannot live side by side. Therefore the BNF Converter generates code skeletons using both Appel's method and a Visitor interface and skeleton (Figure \ref{fig:java}F).

Most familiar Visitor Design Patterns use a Visitee-traversal algorithm. That is to say, visiting the top member of a list will automatically visit all the members of the list. However, the BNF Converter-generated pattern uses Visitor-traversal. This means that it is the Visitor's responsibility, when visiting a list, to visit all the members in turn. This is because certain algorithms that compilers want to implement are not compositional, so performing a transformation on a single member may be quite different than performing that transformation on a certain pattern of nodes. For example, during peephole analysis a compiler may wish to merge to subsequent additions into a single operation, but may want to leave single additions unchanged. In our experience, these types of algorithms are easier to implement if the Visitor itself is in control of the traversal.

\subsection{The Test Bench and Makefile}

With the pretty printer defined it is trivial to define a test bench and Makefile to compile the code. However, the lack of an interactive environment such as Haskell's \texttt{hugs} means that the user is not able to specify which parser is used. Instead the first defined entry-point of the grammar is used by default. However it is quite easy for the user to specify another entry point in the source code, and then recompile the Test Bench.

\subsection{Translation Summary}

Our simple example language results in over 900 lines of Java code and specifications. Given these specifications JLex and CUP generate a further 737 lines of Java code, resulting in over 1600 lines of java code generated from eight lines of specification. However, we find that the main advantage to the Java programmer comes from not having to write dozens of tiny classes to represent their abstract syntax tree. A comparison of Figure \ref{fig:haskell}A and Figure \ref{fig:java}A emphasizes the challenges of implementing a compiler in Java.

Overall, translating from a declarative grammar to an object-oriented syntax tree is certainly possible, however the translation introduces a number of new complications such as the names of instance variables. The BNF Converter tries to solve these problems in a consistent, easy-to-understand way to ease the implementation of the rest of the compiler. Appel's syntax-separate-from-interpretations method introduces several conventions that object-oriented programmers may find confusing at first. However, in practice the ease of using the generated transformation templates should help users to quickly overcome these difficulties.

\input{java}

\section{C++ Code Generation}

It quickly became apparent that we could use the same methodology as the Java code generation to generate C++ code, using Flex \cite{flex} and Bison \cite{bison}. This translation was complicated by some small but significant differences between the two languages, including the additional burden of memory management and destructors, the separation of interface (.H) and implementation (.C or .cpp) files, and the lack of a predefined \texttt{StringBuffer} class such as in Java.

\subsection{The Abstract Syntax}

The abstract syntax tree data types generated in C++ (Figure \ref{fig:cpp}A) are entirely isomorphic to Appel's Java version. The primary difference lies in the fact that we can group all of the classes into one file which will not clutter the user's directory, and the separation of interface and implementation.

The use of C++'s \texttt{typedef} directive in this header file can significantly ease the burden of code generation by matching BNF Converter's names for types with the actual used versions. There is some danger that this  could result in confusion for the user, but in general giving a thing more than one name is not a problem.

The Implementation file trivially implements the methods the interface defines. A fragment is shown in Figure \ref{fig:cpp}B.

\subsection{The Lexer and Parser}
The BNF Converter also generates a lexer specification file for Flex and a parser specification file for Bison. These files are relatively isomorphic to their Haskell and Java counterparts.
Figures \ref{fig:cpp}C and \ref{fig:cpp}D show specifications code equivalent to the examples in Figures \ref{fig:haskell} and \ref{fig:java}.

One complication is that there is no way to access the result of the parse without storing a global pointer to it. This means that every potential entry point production must store a pointer to the parse (the \texttt{YY\_RESULT} variables in Figure \ref{fig:cpp}D), in case they are the final successful category. Users can limit the performance impact of this by using the \texttt{entrypoints} pragma.

Another complication arises from Bison (and YACC)'s \texttt{yylval} variable. This is a union type of all possible results for the \$\$ meta-variable which stores the result of a parse. So in order to set \$\$ as in the examples above, we need to include the types PROG, EXP, and ListEXP, as the \%token commands do above. The complication of this is that because the lexer sets \texttt{yylval}, the lexer needs to know about the existence of these types even though it never uses them. The BNF Converter solves this by generating a small header file \texttt{Parser.H} which forward declares all the necessary classes.

\subsection{The Pretty Printer and Skeleton Template}
Because C++ lacks an \texttt{instanceof} operator, the C++ Pretty Printer (Figure \ref{fig:cpp}E) and Skeleton templates use the Visitor Design Pattern. It also implements an Abstract Syntax Viewer, as defined in Section \ref{javapp}. These are all isomorphic to the Java Visitor Design Pattern skeleton described above. To store their result they implement a simple string buffer, which can grow but not shrink.

\subsection{The Test Bench and Makefile}

These behave in a similar way to the Java version.

\subsection{Translation Summary}

The separation of interface and implementation and the existence of destructors make C++ even more verbose than Java. Our example generates over 1100 lines of C++ code and specifications, and Flex and Bison generate an additional 2600 lines of code from these specifications.

Adding a C++ translation after the Java translation was implemented was quite straightforward. In the future, when support is added for generic types in Java, support should also be added for C++'s Standard Template Library.

\input{cpp}

\section{C Code Generation}

\subsection{The Abstract Syntax}

The translation to C code is quite different than the other languages. It follows the methodology used by Appel in the C Version of his textbook \cite{AppelC}.

In this methodology, each grammar category is represented by a C \texttt{struct}. Each struct has an enumerated type indicating which LBNF Label it represents, and a \texttt{union} of pointers to all corresponding non-terminal categories. So our boolean expressions example would generate the structs shown in Figure \ref{fig:c}A. Structs are originally named with an underscore, and \texttt{typdef} declarations clean up the code by making the original grammar name refer to a pointer to that struct.

Data structure instances are created by using constructor functions, which are generated for each struct (Figure \ref{fig:c}B). These functions are straightforward to generate and take the place of the \texttt{new} operator and object constructors in C++.

\subsection{The Lexer and Parser}

This work is virtually unchanged from the C++ version.

\subsection{The Pretty Printer and Case Skeleton}

Any algorithm that wishes to traverse the tree must switch on the \texttt{kind} field of each node, then recurse to the appropriate members. For example, Figure \ref{fig:c}C shows the Pretty-Printer traversal. The abstract syntax tree viewer and skeleton template work in an isomorphic way.

One small complication is that ANSI C does not allow two functions with the same name that have different type definitions. Therefore while the C++ version can use a function called \texttt{render} for both Strings and Chars, the C version of the pretty printer explicitly names the difference as \texttt{renderS} and \texttt{renderC}.

\subsection{The Test Bench and Makefile}

These are virtually unchanged from the C++ version.

\subsection{Translation Summary}

While it is straightforward to generate a parser and a data structure to represent the results of a parse in C, it seems that this methodology is the hardest for the programmer to use. The unfriendly combination of pointers and unions (seen in Figures \ref{fig:c}B and \ref{fig:c}C) result in confusion as to when the user should use an arrow operator (\texttt{-$>$}) and when they should use a period. We are currently looking into ways to make the generated code more user-friendly through the use of macros or other methods.

\input{c}


\section{Discussion}

\subsection{The BNF Converter as a teaching tool}

\label{results}

The BNF Converter was introduced as a teaching tool at the 
fourth-year compiler course in Spring 2003 at Chalmers.
The goal was, on the one hand, to advocate the use of declarative
and portable language definitions, and on the other hand, to
leave more time for back-end construction in a compiler course.
The results were encouraging: the lexer+parser part of the compiler 
was estimated only to be 25 \% of the work at the lowest grade, 
and 10 \% at the highest grade (at which the student compiler had to 
include several back ends). This was far from the times 
when the parser was more than 50 \% of a student compiler.

In Autumn 2003, BNFC is being used in a second-year Chalmers course
on Programming Languages. It is replacing the previously-used parser 
combinator libraries (in Haskell).
The main motivation at this level is to teach the correspondence
between parsers and grammars, and to
provide a high-level parser tool also for programmers who do not know Haskell.

One worry about using BNFC on the more advanced course
was that students would not really learn parsing, but just to write grammars. 
The answer to this is that students writing their parsers in YACC 
are equally isolated from the internals of LR parsing as 
those writing in LBNF. In fact, as learning
the formalism takes less time in the case of LBNF, 
the teacher can allocate more
time for explaining how the LR parser works. 



\subsection{Real-world languages}

Students in a compiler class usually implement toy languages.
What about real-world languages?
As an experiment, a complete LBNF definition of
ANSI C, with \cite{Kern88} as reference, was 
written\footnote{
BSc thesis of Ulf Persson at Chalmers.}. 
The result was a complete front-end processor for ANSI C, 
with the exception (mentioned in \cite{Kern88}) of type definitions,
which have to be treated with a preprocessor. The grammar has 229
LBNF rules and 15 token pragmas (to deal with different numeral
literals, such as octals and hexadecimals).

Another real-world example is the
object-oriented specification language OCL \cite{WarmerKleppe99}\footnote{
Work by Kristofer Johannisson at Chalmers.}. 
And of course, the BNF Converter itself is implemented by 
using modules generated from an LBNF grammar of LBNF (see the Appendix).


\input{prototyping}

\subsection{Related work}

The BNF Converter adds a level of abstraction to the YACC \cite{johnson-yacc} 
tradition of compiler compilers,
since it compiles a yet higher-level notation into
notations on the level of YACC.
Another system on this 
level up from YACC is Cactus \cite{Cactus},
which uses an EBNF-like notation to 
generate front ends in Haskell and C.
Unlike the BNF Converter, Cactus aims for completeness,
and the notation is therefore less simple than LBNF. 
It is not possible to extract a pretty printer from a Cactus grammar, 
and Cactus does not generate documentation.

The Zephyr definition language 
\cite{zephyr}\ defines a portable format for abstract syntax
and translates it into SML, Haskell, C, C++,
Java, and SGML, together with functions for displaying syntax trees.
But Zephyr does not support the definition of concrete syntax.

In general, compiler tools almost invariably opt for expressive power 
rather than declarativity and simplicity.
The situation is different in linguistics, where
the declarativity and \textit{reversibility} (i.e.\ usability for both
parsing and generation) of grammar formalisms are highly valued. A major 
example of this philosophy are Definite Clause Grammars (DCG) 
\cite{dcg}. Since DCGs are usually implemented as an embedded 
language in Prolog, 
features of full Prolog are sometimes smuggled into DCG grammars;
but this is usually considered harmful since it destroys
declarativity.



\section{Conclusions and future work}

BNF Converter is a tool implementing the Labelled BNF grammar formalism
(LBNF). Given that a programming language is
``well-behaved'', in a rather intuitive sense, an
LBNF grammar is the only source that is needed to implement
a front end for the language, together with matching
LaTeX documentation. Since LBNF is purely declarative, 
the implementation can be generated in different languages:
these currently include Haskell, Java, C++, and C, each with
their standard parser and lexer tools. Depending on
the tools, the size of the generated code is typically 
50--100 times the size of the LBNF source.

The approach has proven to be useful both in teaching and
in language prototyping. As for legacy real-world languages, 
complete definitions have so far been written for C and OCL.
Often a language is almost definable, but has some
exotic features that would require stronger tools.
We have, however, opted to keep LBNF simple, at the expense
of expressivity; and we believe that there are many good reasons
behind a trend toward
more and more well-behaved programming languages.

The LBNF formalism itself does not seem to require much
elaboration. As for the implementation, support for generic types in Java's upcoming release and C++ Standard Template Library should be added. Finally, the most frequent wish has been
to make it possible to retain position information
in the abstract syntax tree, so that error messages from later
compiler phases can be linked to the source code.



\bibliographystyle{plain}
\bibliography{CC-2004}

\input{lbnf_spec}

\end{document}
